Key Word(s): matplotlib, seaborn, plots, pandas



<<<<<<< HEAD <<<<<<< HEAD <<<<<<< HEAD

CS109A Introduction to Data Science

Lab 5: Exploratory Data Analysis, seaborn, more Plotting

Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Material Preparation: Eleni Kaxiras.


======= ======= >>>>>>> upstream/master
=======
>>>>>>> upstream/master

CS109A Introduction to Data Science

Lab 5: Exploratory Data Analysis, seaborn, more Plotting

Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, and Chris Tanner
Material Preparation: Eleni Kaxiras.

<<<<<<< HEAD
<<<<<<< HEAD >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 ======= >>>>>>> upstream/master =======
>>>>>>> upstream/master
<<<<<<< HEAD
In [1]:
<<<<<<< HEAD
#RUN THIS CELL 
=======
#RUN THIS CELL 
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [67]:
#RUN THIS CELL 
>>>>>>> upstream/master
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
<<<<<<< HEAD
Out[1]:
=======
Out[67]:
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
In [2]:
# import the necessary libraries
=======
In [ ]:
# import the necessary libraries
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [68]:
# import the necessary libraries
>>>>>>> upstream/master
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
import time
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 200)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import warnings
warnings.filterwarnings('ignore')
%config InlineBackend.figure_format ='retina'
<<<<<<< HEAD <<<<<<< HEAD
In [3]:
%%javascript
=======
In [ ]:
%%javascript
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [69]:
%%javascript
>>>>>>> upstream/master
IPython.OutputArea.auto_scroll_threshold = 9999;
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
<<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

Learning Goals

By the end of this lab, you should be able to:

  • know how to implement the different types of plots such as histograms, boxplots, etc, that were mentioned in class.
  • have seaborn as well as matplotlib in your plotting toolbox.

This lab corresponds to lecture 6 up to 9 and maps to homework 3.

<<<<<<< HEAD <<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
<<<<<<< HEAD <<<<<<< HEAD <<<<<<< HEAD

1 - Visualization Inspiration

title

source: nytimes.org

Notice that in “Summers Are Getting Hotter,” above, the histogram has intervals for global summer temperatures on the x-axis, designated from extremely cold to extremely hot, and their frequency on the y-axis.

That was an infographic intended for the general public. In contrast, take a look at the plots below of the same data published at a scientific journal. They look quite different, don't they?

title

======= ======= >>>>>>> upstream/master
=======
>>>>>>> upstream/master

1 - Visualization Inspiration

title

source: nytimes.org

Notice that in “Summers Are Getting Hotter,” above, the histogram has intervals for global summer temperatures on the x-axis, designated from extremely cold to extremely hot, and their frequency on the y-axis.

That was an infographic intended for the general public. In contrast, take a look at the plots below of the same data published at a scientific journal. They look quite different, don't they?

<<<<<<< HEAD

title

<<<<<<< HEAD >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 ======= >>>>>>> upstream/master =======

title

>>>>>>> upstream/master

James Hansen, Makiko Sato, and Reto Ruedy, Perception of climate change. PNAS

<<<<<<< HEAD <<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

2 - Implementing Various Types of Plots using matplotlib and seaborn.

Before you start coding your visualization, you need to decide what type of vizualization to use. A box plot, a histogram, a scatter plot, or something else? That will depend on the purpose of the plot (is it for performing an inspection on your data (EDA, or for showing your results/conclusions to people) and the number variables that you want to plot.

You have a lot of tools for plotting in Python. The basic one, of course, is matplotlib and there are other libraries that are built on top of it, such as seaborn, bokeh, or altair.

In this class we will continue using matplotlib and also look into seaborn. Those two libraries are the ones you should be using for homework.

Introduction to seaborn

Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics. The library provides a database of useful datasets for educational purposes that can be loaded by typing:

seaborn.load_dataset(name, cache=True, data_home=None, **kws)

For information on what these datasets are : https://github.com/mwaskom/seaborn-data

The plotting functions in seaborn can be decided in two categories

  • 'axes-level' functions, such as regplot, boxplot, kdeplot, scatterplot, distplot which can connect with the matplotlib Axes object and its parameters. You can use that object as you would in matplotlib:

    f, (ax1, ax2) = plt.subplots(2)
    sns.regplot(x, y, ax=ax1)
    sns.kdeplot(x, ax=ax2)
    ax1 = sns.distplot(x, kde=False, bins=20)
    
  • <<<<<<< HEAD <<<<<<< HEAD
  • 'figure-level' functions, such as lmplot, factorplot, jointplot, relplot, pairplot. In this case, seaborn organizes the resulting plot which may include several Axes in a meaningful way. That means that the functions need to have total control over the figure, so it isn't possible to plot, say, an lmplot onto one that already exists. Calling the function always initializes a figure and sets it up for the specific plot it's drawing. These functions return an object of the type FacetGrid with its own methods for operating on the resulting plot.

    =======
  • 'figure-level' functions, such as lmplot, factorplot, jointplot, relplot. In this case, seaborn organizes the resulting plot which may include several Axes in a meaningful way. That means that the functions need to have total control over the figure, so it isn't possible to plot, say, an lmplot onto one that already exists. Calling the function always initializes a figure and sets it up for the specific plot it's drawing. These functions return an object of the type FacetGrid with its own methods for operating on the resulting plot.

    >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
  • 'figure-level' functions, such as lmplot, factorplot, jointplot, relplot. In this case, seaborn organizes the resulting plot which may include several Axes in a meaningful way. That means that the functions need to have total control over the figure, so it isn't possible to plot, say, an lmplot onto one that already exists. Calling the function always initializes a figure and sets it up for the specific plot it's drawing. These functions return an object of the type FacetGrid with its own methods for operating on the resulting plot.

    >>>>>>> upstream/master

To set the parameters for figure-level functions:

sns.set_context("notebook", font_scale=1, rc={"lines.linewidth": 2.5})
<<<<<<< HEAD <<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

The Titanic dataset

The titanic.csv file contains data for 887 passengers on the Titanic. Each row represents one person. The columns describe different attributes about the person including whether they survived, their age, their on-board class, their sex, and the fare they paid.

<<<<<<< HEAD <<<<<<< HEAD
In [4]:
titanic = sns.load_dataset('titanic');
=======
In [ ]:
titanic = sns.load_dataset('titanic');
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [70]:
titanic = sns.load_dataset('titanic');
>>>>>>> upstream/master
titanic.info();
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master

RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
survived       891 non-null int64
pclass         891 non-null int64
sex            891 non-null object
age            714 non-null float64
sibsp          891 non-null int64
parch          891 non-null int64
fare           891 non-null float64
embarked       889 non-null object
class          891 non-null category
who            891 non-null object
adult_male     891 non-null bool
deck           203 non-null category
embark_town    889 non-null object
alive          891 non-null object
alone          891 non-null bool
dtypes: bool(2), category(2), float64(2), int64(4), object(5)
memory usage: 80.6+ KB
<<<<<<< HEAD
In [5]:
titanic.columns
=======
In [ ]:
titanic.columns
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [71]:
titanic.columns
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
Out[5]:
=======
Out[71]:
>>>>>>> upstream/master
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone'], dtype='object')
<<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master
Exercise: Drop the following features:

'embarked', 'who', 'adult_male', 'embark_town', 'alive', 'alone'

<<<<<<< HEAD <<<<<<< HEAD
In [6]:
# your code here
mary = ['embarked', 'who', 'adult_male', 'embark_town', 'alive', 'alone']
titanic = titanic.drop(columns=mary)
titanic
=======
In [ ]:
# your code here
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [72]:
# your code here
# your code here
columns = ['embarked', 'who', 'adult_male', 'embark_town', 'alive', 'alone']
titanic = titanic.drop(columns=columns)
titanic
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
Out[6]:
=======
Out[72]:
>>>>>>> upstream/master
<<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master
survived pclass sex age sibsp parch fare class deck
000 3 male 22.0 1 0 7.2500 Third NaN
111 1 female 38.0 1 0 71.2833 First C
221 3 female 26.0 0 0 7.9250 Third NaN
331 1 female 35.0 1 0 53.1000 First C
440 3 male 35.0 0 0 8.0500 Third NaN
... ... ... ... ... ... ... ... ... ...
886 0 2 male 27.0 0 0 13.00005 0 3 male NaN 0 0 8.4583 Third NaN
6 0 1 male 54.0 0 0 51.8625 First E
7 0 3 male 2.0 3 1 21.0750 Third NaN
8 1 3 female 27.0 0 2 11.1333 Third NaN
9 1 2 female 14.0 1 0 30.0708Second NaN
887 1 1 female 19.0 0 0 30.0000 First B
888 0 3 female NaN 1 2 23.4500 Third NaN
889 1 1 male 26.0 0 0 30.0000 First C
890 0 3 male 32.0 0 0 7.7500 Third NaN
10 1 3 female 4.0 1 1 16.7000 Third G
11 1 1 female 58.0 0 0 26.5500 First C
12 0 3 male 20.0 0 0 8.0500 Third NaN
13 0 3 male 39.0 1 5 31.2750 Third NaN
14 0 3 female 14.0 0 0 7.8542 Third NaN
15 1 2 female 55.0 0 0 16.0000 Second NaN
16 0 3 male 2.0 4 1 29.1250 Third NaN
17 1 2 male NaN 0 0 13.0000 Second NaN
18 0 3 female 31.0 1 0 18.0000 Third NaN
19 1 3 female NaN 0 0 7.2250 Third NaN
20 0 2 male 35.0 0 0 26.0000 Second NaN
21 1 2 male 34.0 0 0 13.0000 Second D
22 1 3 female 15.0 0 0 8.0292 Third NaN
23 1 1 male 28.0 0 0 35.5000 First A
24 0 3 female 8.0 3 1 21.0750 Third NaN
25 1 3 female 38.0 1 5 31.3875 Third NaN
26 0 3 male NaN 0 0 7.2250 Third NaN
27 0 1 male 19.0 3 2 263.0000 First C
28 1 3 female NaN 0 0 7.8792 Third NaN
29 0 3 male NaN 0 0 7.8958 Third NaN
... ... ... ... ... ... ... ... ... ...
861 0 2 male 21.0 1 0 11.5000 Second NaN
862 1 1 female 48.0 0 0 25.9292 First D
863 0 3 female NaN 8 2 69.5500 Third NaN
864 0 2 male 24.0 0 0 13.0000 Second NaN
865 1 2 female 42.0 0 0 13.0000 Second NaN
866 1 2 female 27.0 1 0 13.8583 Second NaN
867 0 1 male 31.0 0 0 50.4958 First A
868 0 3 male NaN 0 0 9.5000 Third NaN
869 1 3 male 4.0 1 1 11.1333 Third NaN
870 0 3 male 26.0 0 0 7.8958 Third NaN
871 1 1 female 47.0 1 1 52.5542 First D
872 0 1 male 33.0 0 0 5.0000 First B
873 0 3 male 47.0 0 0 9.0000 Third NaN
874 1 2 female 28.0 1 0 24.0000 Second NaN
875 1 3 female 15.0 0 0 7.2250 Third NaN
876 0 3 male 20.0 0 0 9.8458 Third NaN
877 0 3 male 19.0 0 0 7.8958 Third NaN
878 0 3 male NaN 0 0 7.8958 Third NaN
879 1 1 female 56.0 0 1 83.1583 First C
880 1 2 female 25.0 0 1 26.0000 Second NaN
881 0 3 male 33.0 0 0 7.8958 Third NaN
882 0 3 female 22.0 0 0 10.5167 Third NaN
883 0 2 male 28.0 0 0 10.5000 Second NaN
884 0 3 male 25.0 0 0 7.0500 Third NaN
885 0 3 female 39.0 0 5 29.1250 Third NaN
886 0 2 male 27.0 0 0 13.0000 Second NaN
887 1 1 female 19.0 0 0 30.0000 First B
888 0 3 female NaN 1 2 23.4500 Third NaN
889 1 1 male 26.0 0 0 30.0000 First C
890 0 3 male 32.0 0 0 7.7500 Third NaN

891 rows × 9 columns

<<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master
Exercise: Find for how many passengeres we do not have their deck information.
<<<<<<< HEAD <<<<<<< HEAD
In [7]:
# your code here
missing_decks = len(titanic[(pd.isna(titanic['deck']) == True)])
missing_decks
=======
In [ ]:
# your code here
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [73]:
# your code here
# your code here
missing_decks = len(titanic[(pd.isna(titanic['deck']) == True)])
missing_decks
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
Out[7]:
=======
Out[73]:
>>>>>>> upstream/master
688
<<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

Histograms

Plotting one variable's distribution (categorical and continous)

The most convenient way to take a quick look at a univariate distribution in seaborn is the distplot() function. By default, this will draw a histogram and fit a kernel density estimate (KDE).

A histogram displays a quantitative (numerical) distribution by showing the number (or percentage) of the data values that fall in specified intervals. The intervals are on the x-axis and the number of values falling in each interval, shown as either a number or percentage, are represented by bars drawn above the corresponding intervals.

<<<<<<< HEAD <<<<<<< HEAD
In [9]:
# What was the age distribution among passengers in the Titanic?
=======
In [ ]:
# What was the age distribution among passengers in the Titanic?
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [74]:
# What was the age distribution among passengers in the Titanic?
>>>>>>> upstream/master
import seaborn as sns
sns.set(color_codes=True)

f, ax = plt.subplots(1,1, figsize=(8, 3));
ax = sns.distplot(titanic.age, kde=False, bins=20)

# bug
#ax = sns.distplot(titanic.age, kde=False, bins=20).set(xlim=(0, 90));

ax.set(xlim=(0, 90));
ax.set_ylabel('counts');
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="510" /> >>>>>>> upstream/master
<<<<<<< HEAD
In [10]:
f, ax = plt.subplots(1,1, figsize=(8, 3))
=======
In [ ]:
f, ax = plt.subplots(1,1, figsize=(8, 3))
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [75]:
f, ax = plt.subplots(1,1, figsize=(8, 3))
>>>>>>> upstream/master
ax.hist(titanic.age, bins=20);
ax.set_xlim(0,90);
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="494" /> >>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD
Exercise (pandas trick): Count all the infants on board (age less than 3) and all the children ages 3-10.
======= ======= >>>>>>> upstream/master
======= >>>>>>> upstream/master
Exercise (pandas trick): Count all the infants on board (age less than 3) and all the children ages 5-10.
<<<<<<< HEAD >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 ======= >>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
In [11]:
# your code here
=======
In [ ]:
# your code here
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [76]:
# your code here
>>>>>>> upstream/master
infants = len(titanic[(titanic.age < 3)]) 
children = len(titanic[(titanic.age >= 3) & (titanic.age < 10)]) 
print(f'There were {infants} infants and {children} children on board the Titanic')
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
There were 24 infants and 38 children on board the Titanic
<<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

Pandas trick: We want to creat virtual "bins" for readability and replace ranges of values with categories.

We will do this in an ad hoc way, it can be done better. For example in the previous plot we could set:

  • (age<3) = 'infants',
  • (3,
  • <<<<<<< HEAD <<<<<<< HEAD <<<<<<< HEAD
  • (18
  • =======
  • (18
  • >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
  • (18
  • >>>>>>> upstream/master =======
  • (18
  • >>>>>>> upstream/master

See matplotlib colors here.

<<<<<<< HEAD <<<<<<< HEAD
In [12]:
# set the colors
=======
In [ ]:
# set the colors
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [77]:
# set the colors
>>>>>>> upstream/master
cmap = plt.get_cmap('Pastel1')
young = cmap(0.5)
middle = cmap(0.2)
older = cmap(0.8)

# get the object we will change - patches is an array with len: num of bins
fig, ax = plt.subplots()
y_values, bins, patches = ax.hist(titanic.age, 10)

[patches[i].set_facecolor(young) for i in range(0,1)] # bin 0
[patches[i].set_facecolor(middle) for i in range(1,3)] # bins 1 and 2
[patches[i].set_facecolor(older) for i in range(3,10)] # 7 remaining bins 

ax.grid(True)
fig.show()
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="377" /> >>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

Kernel Density Estimation

The kernel density estimate can be a useful tool for plotting the shape of a distribution. The bandwidth (bw) parameter of the KDE controls how tightly the estimation is fit to the data, much like the bin size in a histogram. It corresponds to the width of the kernels we plotted above. The default behavior tries to guess a good value using a common reference rule, but it may be helpful to try larger or smaller values.

<<<<<<< HEAD <<<<<<< HEAD
In [13]:
sns.kdeplot(titanic.age, bw=0.6, label="bw: 0.6", shade=True, color="r");
=======
In [ ]:
sns.kdeplot(titanic.age, bw=0.6, label="bw: 0.6", shade=True, color="r");
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [78]:
sns.kdeplot(titanic.age, bw=0.6, label="bw: 0.6", shade=True, color="r");
>>>>>>> upstream/master
sns.kdeplot(titanic.age, bw=2, label="bw: 2", shade=True);
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="386" /> >>>>>>> upstream/master
<<<<<<< HEAD
======= ======= >>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 ======= >>>>>>> upstream/master =======
>>>>>>> upstream/master
Exercise: Plot the distribution of fare paid by passengers
<<<<<<< HEAD <<<<<<< HEAD
In [14]:
# your code here
=======
In [ ]:
# your code here
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [79]:
# your code here
>>>>>>> upstream/master
sns.kdeplot(titanic.fare, bw=0.5, label="bw: 0.5", shade=True);
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="380" /> >>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

You can mix elements of matplotlib such as Axes with seaborn elements for a best use of both worlds.

<<<<<<< HEAD <<<<<<< HEAD
In [15]:
import seaborn as sns
=======
In [ ]:
import seaborn as sns
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [80]:
import seaborn as sns
>>>>>>> upstream/master
sns.set(color_codes=True)

x1 = np.random.normal(size=100)
x2 = np.random.normal(size=100)

fig, ax = plt.subplots(1,2, figsize=(15,5))

# seaborn goes in first subplot
sns.set(font_scale=0.5)
sns.distplot(x1, kde=False, bins=15, ax=ax[0]);
sns.distplot(x2, kde=False, bins=15, ax=ax[0]);
ax[0].set_title('seaborn Graph Here', fontsize=14)
ax[0].set_xlabel(r'$x$', fontsize=14)
ax[0].set_ylabel(r'$count$', fontsize=14)

# matplotlib goes in second subplot
ax[1].hist(x1, alpha=0.2, bins=15, label=r'$x1$');
ax[1].hist(x2, alpha=0.5, bins=15, label=r'$x2$');
ax[1].set_xlabel(r'$x$', fontsize=14)
ax[1].set_ylabel(r'$count$', fontsize=14)
ax[1].set_title('matplotlib Graph Here', fontsize=14)
ax[1].legend(loc='best', fontsize=14);
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="899" /> >>>>>>> upstream/master
<<<<<<< HEAD
======= ======= >>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 ======= >>>>>>> upstream/master =======
>>>>>>> upstream/master

Introduding the heart disease dataset.

More on this in the in-class exercise at the end of the notebook.

<<<<<<< HEAD <<<<<<< HEAD
In [16]:
columns = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg", 
=======
In [ ]:
columns = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg", 
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [81]:
columns = ["age", "sex", "cp", "restbp", "chol", "fbs", "restecg", 
>>>>>>> upstream/master
           "thalach", "exang", "oldpeak", "slope", "ca", "thal", "num"]
heart_df = pd.read_csv('../data/heart_disease.csv', header=None, names=columns)

heart_df.head()
<<<<<<< HEAD <<<<<<< HEAD
Out[16]:
=======
Out[81]:
>>>>>>> upstream/master
<<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master
age sex cp restbp chol fbs restecg thalach exang oldpeak slope ca thal num
0063.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3 3.0 0.0 6.0 0.0
1167.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2.0
2267.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6 2.0 2.0 7.0 1.0
3337.0 1.0 3.0 130.0 250.0 0.0 0.0 187.0 0.0 3.5 3.0 0.0 3.0 0.0
4441.0 0.0 2.0 130.0 204.0 0.0 2.0 172.0 0.0 1.4 1.0 0.0 3.0 0.0
<<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

Boxplots

One variable.

<<<<<<< HEAD <<<<<<< HEAD
In [17]:
# seaborn
=======
In [ ]:
# seaborn
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [82]:
# seaborn
>>>>>>> upstream/master
ax = sns.boxplot(x='age', data=titanic)
#ax = sns.boxplot(x=titanic['age']) # another way to write this
ax.set_ylabel(None);
ax.set_xlabel('age', fontsize=14);
ax.set_title('Distribution of age in the Titanic', fontsize=14);
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="349" /> >>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

Two variables

<<<<<<< HEAD <<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master
Exercise: Did more young people or older ones get first class tickets on the Titanic?
<<<<<<< HEAD <<<<<<< HEAD
In [18]:
# your code here
# two variables seaborn
ax = sns.boxplot(x="class", y="age", data=titanic)
=======
In [ ]:
=======
In [83]:
>>>>>>> upstream/master
# your code here
# two variables seaborn
<<<<<<< HEAD
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
ax = sns.boxplot(x='class', y='age', data=titanic)
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="373" /> >>>>>>> upstream/master
<<<<<<< HEAD
In [19]:
# two variable boxplot in pandas
titanic.boxplot('age',by='class')
=======
In [ ]:
# two variable boxplot in pandas
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [84]:
# two variable boxplot in pandas
titanic.boxplot('age',by='class')
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
Out[19]:
=======
Out[84]:
>>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master
======= " width="375" /> >>>>>>> upstream/master
<<<<<<< HEAD

Scatterplots

Plotting the distribution of two variables

Also called a bivariate distribution where each observation is shown with a point with x and y values. You can draw a scatterplot with the matplotlib plt.scatter function, or the seaborn jointplot() function:

=======
<<<<<<< HEAD ======= >>>>>>> upstream/master
=======
>>>>>>> upstream/master

Scatterplots

Plotting the distribution of two variables

Also called a bivariate distribution where each observation is shown with a point with x and y values. You can draw a scatterplot with the matplotlib plt.scatter function, or the seaborn jointplot() function:

<<<<<<< HEAD >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 ======= >>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
In [20]:
f, ax = plt.subplots(1,1, figsize=(10, 5))
=======
In [ ]:
f, ax = plt.subplots(1,1, figsize=(10, 5))
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [85]:
f, ax = plt.subplots(1,1, figsize=(10, 5))
>>>>>>> upstream/master
sns.scatterplot(x="fare", y="age", data=titanic, ax=ax); 
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="597" /> >>>>>>> upstream/master
<<<<<<< HEAD
In [21]:
sns.jointplot("fare", "age", data=titanic, s=40, edgecolor="w", linewidth=1)
=======
In [ ]:
sns.jointplot("fare", "age", data=titanic, s=40, edgecolor="w", linewidth=1)
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [86]:
sns.jointplot("fare", "age", data=titanic, s=40, edgecolor="w", linewidth=1)
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
Out[21]:
=======
Out[86]:
>>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="431" /> >>>>>>> upstream/master
<<<<<<< HEAD
======= ======= >>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 ======= >>>>>>> upstream/master =======
>>>>>>> upstream/master

You may control the seaborn Figure aesthetics.

<<<<<<< HEAD <<<<<<< HEAD
In [22]:
# matplotlib
=======
In [ ]:
# matplotlib
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [87]:
# matplotlib
>>>>>>> upstream/master
fig, ax = plt.subplots(1,1, figsize=(10,6))
ax.scatter(heart_df['age'], heart_df['restbp'], alpha=0.8);
ax.set_xlabel(r'$Age (yrs)$', fontsize=15);
ax.set_ylabel(r'Resting Blood Pressure (mmHg)', fontsize=15);
ax.set_title('Age vs. Resting Blood Pressure', fontsize=14)
plt.show();
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="609" /> >>>>>>> upstream/master
<<<<<<< HEAD
======= ======= >>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 ======= >>>>>>> upstream/master =======
>>>>>>> upstream/master

Plotting the distribution of three variables

<<<<<<< HEAD <<<<<<< HEAD
In [23]:
f, ax = plt.subplots(1,1, figsize=(10, 5))
=======
In [ ]:
f, ax = plt.subplots(1,1, figsize=(10, 5))
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [22]:
f, ax = plt.subplots(1,1, figsize=(10, 5))
>>>>>>> upstream/master
sns.scatterplot(x="fare", y="age", hue="survived", data=titanic, ax=ax);
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="597" /> >>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

Plotting the distribution of four variables (going too far?)

<<<<<<< HEAD <<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master
Exercise: Plot the distribution of fare paid by passengers according to age, survival and sex.

Use size= for the fourth variable

<<<<<<< HEAD <<<<<<< HEAD
In [24]:
# your code here
f, ax = plt.subplots(1,1, figsize=(10, 5))
sns.scatterplot(x="fare", y="age", hue="survived", size="sex", data=titanic, ax=ax);
=======
In [ ]:
# your code here
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [23]:
# your code here
f, ax = plt.subplots(1,1, figsize=(10, 5))
sns.scatterplot(x="fare", y="age", hue="survived", size="sex", data=titanic, ax=ax);
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="597" /> >>>>>>> upstream/master
<<<<<<< HEAD
======= ======= >>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 ======= >>>>>>> upstream/master =======
>>>>>>> upstream/master

Pairplots

<<<<<<< HEAD <<<<<<< HEAD
In [25]:
titanic.columns
=======
In [ ]:
titanic.columns
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [24]:
titanic.columns
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
Out[25]:
=======
Out[24]:
>>>>>>> upstream/master
Index(['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'class', 'deck'], dtype='object')
<<<<<<< HEAD
In [26]:
to_plot = ['age', 'fare', 'survived', 'deck']
=======
In [ ]:
to_plot = ['age', 'fare', 'survived', 'deck']
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [25]:
to_plot = ['age', 'fare', 'survived', 'deck']
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
In [34]:
df_to_plot = titanic.loc[:,to_plot]
=======
In [ ]:
df_to_plot = titanic.loc[:,to_plot]
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [28]:
df_to_plot = titanic.loc[:,to_plot]
>>>>>>> upstream/master
sns.pairplot(df_to_plot);
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="539" /> >>>>>>> upstream/master
<<<<<<< HEAD
In [28]:
from pandas.plotting import scatter_matrix
=======
In [ ]:
from pandas.plotting import scatter_matrix
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [29]:
from pandas.plotting import scatter_matrix
>>>>>>> upstream/master
scatter_matrix(df_to_plot, alpha=0.8, figsize=(10, 10), diagonal='kde');
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="604" /> >>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

Plotting Categorical Variables

<<<<<<< HEAD <<<<<<< HEAD
In [37]:
titanic = sns.load_dataset('titanic')
=======
In [ ]:
titanic = sns.load_dataset('titanic')
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [35]:
titanic = sns.load_dataset('titanic')
>>>>>>> upstream/master
f, ax = plt.subplots(figsize=(7, 3));
ax = sns.countplot(y="deck", data=titanic, color="c");
ax.set_title('Titanic');
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="447" /> >>>>>>> upstream/master
<<<<<<< HEAD
In [38]:
ax = sns.countplot(x="class", data=titanic)
=======
In [ ]:
ax = sns.countplot(x="class", data=titanic)
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [42]:
ax = sns.countplot(x="class", data=titanic)
>>>>>>> upstream/master
ax.set_title('Titanic');
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="406" /> >>>>>>> upstream/master
<<<<<<< HEAD
In [39]:
fig, ax = plt.subplots(figsize=(10,6)) # Create figure object
=======
In [ ]:
fig, ax = plt.subplots(figsize=(10,6)) # Create figure object
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [43]:
fig, ax = plt.subplots(figsize=(10,6)) # Create figure object
>>>>>>> upstream/master
sns.set_context("notebook", font_scale=1, rc={"lines.linewidth": 2.5})
ax = sns.countplot(x="deck", data=titanic)
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master
======= " width="620" /> >>>>>>> upstream/master
<<<<<<< HEAD
In [40]:
sns.set(style="ticks", palette="muted")
=======
In [ ]:
sns.set(style="ticks", palette="muted")
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [44]:
sns.set(style="ticks", palette="muted")
>>>>>>> upstream/master
sns.relplot(x="age", y="deck", col="class", data=titanic);
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="1069" /> >>>>>>> upstream/master
<<<<<<< HEAD
In [41]:
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
=======
In [ ]:
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [45]:
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
>>>>>>> upstream/master
sns.pairplot(data=titanic, hue="deck");
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="1498" /> >>>>>>> upstream/master
<<<<<<< HEAD
======= ======= >>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 ======= >>>>>>> upstream/master =======
>>>>>>> upstream/master

Introduction to pandas plotting.

There is plotting functionality built in pandas. Look for it in the pandas "encyclopedia", a mere 2883-page pdf from the creator Wes McKinney: pandas documentation (pdf)

Example: The value_counts() Series method and top-level function computes a histogram of a 1D array of values. It can also be used as a function on regular arrays.

Reminder: DataFrame: “index” (axis=0, default), “columns” (axis=1)

<<<<<<< HEAD <<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

Line Graph

Good for time dependency or when a variable evolves

<<<<<<< HEAD <<<<<<< HEAD
In [42]:
df = pd.DataFrame(np.random.randn(1000, 4), columns=['A', 'B', 'C', 'D'])
=======
In [ ]:
df = pd.DataFrame(np.random.randn(1000, 4), columns=['A', 'B', 'C', 'D'])
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [46]:
df = pd.DataFrame(np.random.randn(1000, 4), columns=['A', 'B', 'C', 'D'])
>>>>>>> upstream/master
df.head()
<<<<<<< HEAD <<<<<<< HEAD
Out[42]:
=======
Out[46]:
>>>>>>> upstream/master
<<<<<<< HEAD ======= >>>>>>> upstream/master
A B C D
0 -0.072986 0.064586 0.076005 1.768125
1 -1.007168 0.091050 -1.019906 0.741020
2 -0.418693 -1.280488 0.467859 -1.031090
3 -1.178062 -0.718033 0.317143 -1.531387
4 0.297648 -0.211252 0.718495 0.3707360 0.201468 0.002378 0.460038 -0.620872
1 -1.347740 2.317959 -0.128938 1.129856
2 0.827727 -1.546181 -1.576246 -1.113427
3 0.345499 1.469756 -0.234321 1.381242
4 -0.301247 -0.680943 0.669998 -0.203680
<<<<<<< HEAD
In [43]:
# cumulative sum adds column values as it goes
=======
In [ ]:
# cumulative sum adds column values as it goes
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [47]:
# cumulative sum adds column values as it goes
>>>>>>> upstream/master
df = df.cumsum()
df.head()
<<<<<<< HEAD <<<<<<< HEAD
Out[43]:
=======
Out[47]:
>>>>>>> upstream/master
<<<<<<< HEAD ======= >>>>>>> upstream/master
A B C D
0 -0.072986 0.064586 0.076005 1.768125
1 -1.080154 0.155636 -0.943901 2.509145
2 -1.498847 -1.124852 -0.476042 1.478055
3 -2.676909 -1.842885 -0.158899 -0.053332
4 -2.379262 -2.054137 0.559596 0.3174040 0.201468 0.002378 0.460038 -0.620872
1 -1.146272 2.320337 0.331100 0.508984
2 -0.318545 0.774156 -1.245146 -0.604443
3 0.026954 2.243912 -1.479467 0.776799
4 -0.274293 1.562969 -0.809469 0.573119
<<<<<<< HEAD
In [44]:
plt.figure();
=======
In [ ]:
plt.figure();
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [48]:
plt.figure();
>>>>>>> upstream/master
df.plot();
plt.legend(loc='best');
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="386" /> >>>>>>> upstream/master
<<<<<<< HEAD
In [45]:
ts = pd.Series(np.random.randn(1000),
=======
In [ ]:
ts = pd.Series(np.random.randn(1000),
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [49]:
ts = pd.Series(np.random.randn(1000),
>>>>>>> upstream/master
               index=pd.date_range('1/1/2000', periods=1000))
df = pd.DataFrame(np.random.randn(1000, 4), 
                  index=ts.index, columns=list('ABCD'))

df = df.cumsum()
plt.figure();
df.plot();
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="386" /> >>>>>>> upstream/master
<<<<<<< HEAD
======= ======= >>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 ======= >>>>>>> upstream/master =======
>>>>>>> upstream/master

Plotting methods allow for a handful of plot styles other than the default line plot. These methods can be provided as the kind keyword argument to plot(), and include:

  • ‘bar’ or ‘barh’ for bar plots
  • ‘hist’ for histogram
  • ‘box’ for boxplot
  • ‘kde’ or ‘density’ for density plots
  • ‘area’ for area plots
  • ‘scatter’ for scatter plots
  • ‘hexbin’ for hexagonal bin plots
  • ‘pie’ for pie plots

In addition to these kind s, there are the DataFrame.hist(), and DataFrame.boxplot() methods, which use a separate interface. scatter_matrix in pandas.plotting takes a Series or DataFrame as an argument.

<<<<<<< HEAD <<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

Bar Plots

<<<<<<< HEAD <<<<<<< HEAD
In [46]:
plt.figure();
=======
In [ ]:
plt.figure();
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [50]:
plt.figure();
>>>>>>> upstream/master
df.iloc[0].plot(kind='bar');
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="391" /> >>>>>>> upstream/master
<<<<<<< HEAD
In [47]:
df2 = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
=======
In [ ]:
df2 = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [51]:
df2 = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
>>>>>>> upstream/master
df2
<<<<<<< HEAD <<<<<<< HEAD
Out[47]:
=======
Out[51]:
>>>>>>> upstream/master
<<<<<<< HEAD ======= >>>>>>> upstream/master
a b c d
0 0.036437 0.028772 0.636406 0.252079
1 0.550359 0.806193 0.776958 0.408668
2 0.212565 0.949430 0.236970 0.336636
3 0.231369 0.899283 0.925506 0.750473
4 0.960434 0.217803 0.220513 0.541103
5 0.510726 0.459889 0.054106 0.230044
6 0.887885 0.284679 0.520790 0.455553
7 0.432802 0.437612 0.999108 0.604186
8 0.251041 0.253487 0.634895 0.679853
9 0.379598 0.809397 0.546982 0.3470010 0.507876 0.850593 0.986939 0.082241
1 0.230224 0.268668 0.462782 0.440504
2 0.597335 0.136271 0.931408 0.238425
3 0.860248 0.465814 0.312943 0.441529
4 0.895884 0.683348 0.955019 0.545084
5 0.990852 0.210919 0.998533 0.153769
6 0.008812 0.337127 0.981830 0.321036
7 0.190601 0.309422 0.617435 0.442801
8 0.700053 0.867143 0.472762 0.792051
9 0.994677 0.087399 0.767110 0.780876
<<<<<<< HEAD
In [48]:
df2.plot.bar();
=======
In [ ]:
df2.plot.bar();
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [52]:
df2.plot.bar();
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="381" /> >>>>>>> upstream/master
<<<<<<< HEAD
In [49]:
# horizontal bar plot
=======
In [ ]:
# horizontal bar plot
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [53]:
# horizontal bar plot
>>>>>>> upstream/master
df2.plot.barh(stacked=False);
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master
======= " width="367" /> >>>>>>> upstream/master
<<<<<<< HEAD
=======
<<<<<<< HEAD
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

Histograms

<<<<<<< HEAD <<<<<<< HEAD
In [50]:
df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000), 
=======
In [ ]:
df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000), 
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [54]:
df4 = pd.DataFrame({'a': np.random.randn(1000) + 1, 'b': np.random.randn(1000), 
>>>>>>> upstream/master
                    'c': np.random.randn(1000) - 1}, columns=['a', 'b', 'c'])

plt.figure();
df4.plot.hist(alpha=0.5, stacked=False, bins=60);
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="397" /> >>>>>>> upstream/master
<<<<<<< HEAD
======= ======= >>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 ======= >>>>>>> upstream/master =======
>>>>>>> upstream/master

Boxplots

<<<<<<< HEAD <<<<<<< HEAD
In [51]:
color = {'boxes': 'DarkGreen', 'whiskers': 'DarkOrange',
=======
In [ ]:
color = {'boxes': 'DarkGreen', 'whiskers': 'DarkOrange',
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [55]:
color = {'boxes': 'DarkGreen', 'whiskers': 'DarkOrange',
>>>>>>> upstream/master
         'medians': 'DarkBlue', 'caps': 'Gray'}

df = pd.DataFrame(np.random.rand(10, 5), columns=['A', 'B', 'C', 'D', 'E'])
df.plot.box(color=color );
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="381" /> >>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

Area plots

You can create area plots with Series.plot.area() and DataFrame.plot.area(). Area plots are stacked by default. To produce stacked area plot, each column must be either all positive or all negative values.

<<<<<<< HEAD <<<<<<< HEAD
In [52]:
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
=======
In [ ]:
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [56]:
df = pd.DataFrame(np.random.rand(10, 4), columns=['a', 'b', 'c', 'd'])
>>>>>>> upstream/master
df.plot.area(stacked=True);
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master
======= " width="367" /> >>>>>>> upstream/master
<<<<<<< HEAD
In [53]:
df.plot.area(stacked=False);
=======
In [ ]:
df.plot.area(stacked=False);
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [57]:
df.plot.area(stacked=False);
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="381" /> >>>>>>> upstream/master
<<<<<<< HEAD
======= ======= >>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 ======= >>>>>>> upstream/master =======
>>>>>>> upstream/master

Scatterplot

Scatter plot can be drawn by using the DataFrame.plot.scatter() method. Scatter plot requires numeric columns for the x and y axes. These can be specified by the x and y keywords.

<<<<<<< HEAD <<<<<<< HEAD
In [54]:
ax = df.plot.scatter(x='a', y='b', color='DarkBlue', label='Group 1');
=======
In [ ]:
ax = df.plot.scatter(x='a', y='b', color='DarkBlue', label='Group 1');
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [58]:
ax = df.plot.scatter(x='a', y='b', color='DarkBlue', label='Group 1');
>>>>>>> upstream/master
df.plot.scatter(x='c', y='d', color='DarkGreen', label='Group 2', ax=ax);
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
>>>>>>> upstream/master ======= " width="402" /> >>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

pandas Tricks

The copy() method on pandas objects copies the underlying data (though not the axis indexes, since they are immutable) and returns a new object. Note that it is seldom necessary to copy objects. For example, there are only a handful of ways to alter a DataFrame in-place:

  • Inserting, deleting, or modifying a column.
  • Assigning to the index or columns attributes.
  • For homogeneous data, directly modifying the values via the values attribute or advanced indexing.

To be clear, no pandas method has the side effect of modifying your data; almost every method returns a new object, leaving the original object untouched. If the data is modified, it is because you did so explicitly

<<<<<<< HEAD <<<<<<< HEAD <<<<<<< HEAD

4 - Group Exercise: 1/2 hour in the Life of a Cardiologist

Try each exercise on your own and then discuss with your peers sitting at your table. Feel free to engage the TFs and instructors as well.

Visualize and explore the data. Use .describe() to look at your data and also examine if you have any missing values.
What is the actual number of feature variables after converting categorical variables to dummy ones?

======= ======= >>>>>>> upstream/master

4 - Group Exercise: 1/2 hour in the Life of a Cardiologist

Try each exercise on your own and then discuss with your peers sitting at your table. Feel free to engage the TFs and instructors as well.

Visualize and explore the data. Use .describe() to look at your data and also examine if you have any missing values.
What is the actual number of feature variables after converting categorical variables to dummy ones?

<<<<<<< HEAD >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 ======= >>>>>>> upstream/master =======

4 - Group Exercise: 1/2 hour in the Life of a Cardiologist

Try each exercise on your own and then discuss with your peers sitting at your table. Feel free to engage the TFs and instructors as well.

Visualize and explore the data. Use .describe() to look at your data and also examine if you have any missing values.
What is the actual number of feature variables after converting categorical variables to dummy ones?

>>>>>>> upstream/master List of available variables (includes target variable num):

  • age: continuous
  • sex: categorical, 2 values {0: female, 1: male}
  • cp (chest pain type): categorical, 4 values {1: typical angina, 2: atypical angina, 3: non-angina, 4: asymptomatic angina}
  • restbp (resting blood pressure on admission to hospital): continuous (mmHg)
  • chol (serum cholesterol level): continuous (mg/dl)
  • fbs (fasting blood sugar): categorical, 2 values {0: <= 120 mg/dl, 1: > 120 mg/dl}
  • restecg (resting electrocardiography): categorical, 3 values {0: normal, 1: ST-T wave abnormality, 2: left ventricular hypertrophy}
  • thalach (maximum heart rate achieved): continuous
  • exang (exercise induced angina): categorical, 2 values {0: no, 1: yes}
  • oldpeak (ST depression induced by exercise relative to rest): continuous
  • slope (slope of peak exercise ST segment): categorical, 3 values {1: upsloping, 2: flat, 3: downsloping}
  • ca (number of major vessels colored by fluoroscopy): discrete (0,1,2,3)
  • thal: categorical, 3 values {3: normal, 6: fixed defect, 7: reversible defect}
  • num (diagnosis of heart disease): categorical, 5 values {0: less than 50% narrowing in any major vessel, 1-4: more than 50% narrowing in 1-4 vessels}
<<<<<<< HEAD <<<<<<< HEAD
In [55]:
# load the dataset
=======
In [ ]:
# load the dataset
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [59]:
# load the dataset
>>>>>>> upstream/master
heart_df = pd.read_csv('../data/heart_disease.csv', header=None, names=columns)
heart_df.head()
<<<<<<< HEAD <<<<<<< HEAD
Out[55]:
=======
Out[59]:
>>>>>>> upstream/master
<<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master
age sex cp restbp chol fbs restecg thalach exang oldpeak slope ca thal num
0063.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3 3.0 0.0 6.0 0.0
1167.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 2.0
2267.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6 2.0 2.0 7.0 1.0
3337.0 1.0 3.0 130.0 250.0 0.0 0.0 187.0 0.0 3.5 3.0 0.0 3.0 0.0
4441.0 0.0 2.0 130.0 204.0 0.0 2.0 172.0 0.0 1.4 1.0 0.0 3.0 0.0
<<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

Answer the following question using plots

  1. At what ages do people seek cardiological exams?
  2. Do men seek help more than women?
  3. Examine the variables. How do they relate to one another?
  4. (Variation on 02): What % of men and women seek cardio exams?
  5. Does resting blood pressure increase with age?
<<<<<<< HEAD <<<<<<< HEAD <<<<<<< HEAD

Pandas trick: .replace The response variable (num) is categorical with 5 values, but we don't have enough data to predict all the categories.
Therefore we'll replace num with hd (heart disease): categorical, 2 values {0: no, 1: yes}.
======= ======= >>>>>>> upstream/master

Pandas trick: .replace The response variable (num) is categorical with 5 values, but we don't have enough data to predict all the categories.
Therefore we'll replace num with hd (heart disease): categorical, 2 values {0: no, 1: yes}.
<<<<<<< HEAD >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 ======= >>>>>>> upstream/master =======

Pandas trick: .replace The response variable (num) is categorical with 5 values, but we don't have enough data to predict all the categories.
Therefore we'll replace num with hd (heart disease): categorical, 2 values {0: no, 1: yes}.
>>>>>>> upstream/master Use the code below (take a minute to understand how it works, it's very useful!):

<<<<<<< HEAD <<<<<<< HEAD
In [56]:
# Replace response variable values with a binary response (1: heart disease(hd) or 0: not)
=======
In [ ]:
# Replace response variable values with a binary response (1: heart disease(hd) or 0: not)
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
heart_df['num'].replace(to_replace=[1,2,3,4],value=1,inplace=True)

# Rename column for clarity
heart_df = heart_df.rename(columns = {'num':'hd'})
=======
In [63]:
# Replace response variable values with a binary response (1: heart disease(hd) or 0: not)
#heart_df['num'].replace(to_replace=[1,2,3,4],value=1,inplace=True)

# Rename column for clarity
#heart_df = heart_df.rename(columns = {'num':'hd'})
>>>>>>> upstream/master
heart_df.head()
<<<<<<< HEAD <<<<<<< HEAD
Out[56]:
=======
Out[63]:
>>>>>>> upstream/master
<<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master
age sex cp restbp chol fbs restecg thalach exang oldpeak slope ca thal hd
0063.0 1.0 1.0 145.0 233.0 1.0 2.0 150.0 0.0 2.3 3.0 0.0 6.0 0.0
1167.0 1.0 4.0 160.0 286.0 0.0 2.0 108.0 1.0 1.5 2.0 3.0 3.0 1.0
2267.0 1.0 4.0 120.0 229.0 0.0 2.0 129.0 1.0 2.6 2.0 2.0 7.0 1.0
3337.0 1.0 3.0 130.0 250.0 0.0 0.0 187.0 0.0 3.5 3.0 0.0 3.0 0.0
4441.0 0.0 2.0 130.0 204.0 0.0 2.0 172.0 0.0 1.4 1.0 0.0 3.0 0.0
<<<<<<< HEAD
In [57]:
# look at the features
=======
In [ ]:
# look at the features
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [64]:
# look at the features
>>>>>>> upstream/master
heart_df.info();
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master

RangeIndex: 299 entries, 0 to 298
Data columns (total 14 columns):
age        299 non-null float64
sex        299 non-null float64
cp         299 non-null float64
restbp     299 non-null float64
chol       299 non-null float64
fbs        299 non-null float64
restecg    299 non-null float64
thalach    299 non-null float64
exang      299 non-null float64
oldpeak    299 non-null float64
slope      299 non-null float64
ca         299 non-null float64
thal       299 non-null float64
hd         299 non-null float64
dtypes: float64(14)
memory usage: 32.8 KB
<<<<<<< HEAD
In [58]:
heart_df.describe()
=======
In [ ]:
heart_df.describe()
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [65]:
heart_df.describe()
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
Out[58]:
=======
Out[65]:
>>>>>>> upstream/master
<<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master <<<<<<< HEAD ======= >>>>>>> upstream/master
age sex cp restbp chol fbs restecg thalach exang oldpeak slope ca thal hd
countcount299.000000 299.00000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000 299.000000
meanmean54.521739 0.67893 3.163880 131.715719 246.785953 0.143813 0.989967 149.327759 0.331104 1.058528 1.605351 0.672241 4.745819 0.464883
stdstd9.030264 0.46767 0.964069 17.747751 52.532582 0.351488 0.994903 23.121062 0.471399 1.162769 0.616962 0.937438 1.940977 0.499601
minmin29.000000 0.00000 1.000000 94.000000 100.000000 0.000000 0.000000 71.000000 0.000000 0.000000 1.000000 0.000000 3.000000 0.000000
25%25%48.000000 0.00000 3.000000 120.000000 211.000000 0.000000 0.000000 132.500000 0.000000 0.000000 1.000000 0.000000 3.000000 0.000000
50%50%56.000000 1.00000 3.000000 130.000000 242.000000 0.000000 1.000000 152.000000 0.000000 0.800000 2.000000 0.000000 3.000000 0.000000
75%75%61.000000 1.00000 4.000000 140.000000 275.500000 0.000000 2.000000 165.500000 1.000000 1.600000 2.000000 1.000000 7.000000 1.000000
maxmax77.000000 1.00000 4.000000 200.000000 564.000000 1.000000 2.000000 202.000000 1.000000 6.200000 3.000000 3.000000 7.000000 1.000000
<<<<<<< HEAD <<<<<<< HEAD
=======
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 =======
>>>>>>> upstream/master =======
>>>>>>> upstream/master

At this point you should split in train and test set and work only on the train!!. For simplicity we will not do this in the solutions.

<<<<<<< HEAD <<<<<<< HEAD
In [59]:
# your code here
=======
=======
>>>>>>> upstream/master
In [ ]:
# your code here
<<<<<<< HEAD
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
# 01. what ages do people seek cardiological exams? 
<<<<<<< HEAD <<<<<<< HEAD
In [60]:
# %load solutions/q01.py
fig, ax = plt.subplots(figsize=(8,6)) 
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 2.5})
ax = sns.distplot(heart_df.age, kde=False) #, bins=10);
ax.set_xlim(0, 90);
ax.set_title('Ages seeking cardio exams');
#ax.set_xlabel('age of patient')
=======
In [ ]:
# %load solutions/q01.py
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
In [66]:
%load solutions/q01.py
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD ======= >>>>>>> upstream/master
<<<<<<< HEAD
=======
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/usr/local/lib/python3.7/site-packages/IPython/core/interactiveshell.py in find_user_code(self, target, raw, py_only, skip_encoding_cookie, search_ns)
   3644         try:                                              # User namespace
-> 3645             codeobj = eval(target, self.user_ns)
   3646         except Exception:

 in 

NameError: name 'solutions' is not defined

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
 in 
----> 1 get_ipython().run_line_magic('load', 'solutions/q01.py')

/usr/local/lib/python3.7/site-packages/IPython/core/interactiveshell.py in run_line_magic(self, magic_name, line, _stack_depth)
   2312                 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
   2313             with self.builtin_trap:
-> 2314                 result = fn(*args, **kwargs)
   2315             return result
   2316 

 in load(self, arg_s)

/usr/local/lib/python3.7/site-packages/IPython/core/magic.py in (f, *a, **k)
    185     # but it's overkill for just that one bit of state.
    186     def magic_deco(arg):
--> 187         call = lambda f, *a, **k: f(*a, **k)
    188 
    189         if callable(arg):

/usr/local/lib/python3.7/site-packages/IPython/core/magics/code.py in load(self, arg_s)
    333         search_ns = 'n' in opts
    334 
--> 335         contents = self.shell.find_user_code(args, search_ns=search_ns)
    336 
    337         if 's' in opts:

/usr/local/lib/python3.7/site-packages/IPython/core/interactiveshell.py in find_user_code(self, target, raw, py_only, skip_encoding_cookie, search_ns)
   3646         except Exception:
   3647             raise ValueError(("'%s' was not found in history, as a file, url, "
-> 3648                                 "nor in the user namespace.") % target)
   3649 
   3650         if isinstance(codeobj, str):

ValueError: 'solutions/q01.py' was not found in history, as a file, url, nor in the user namespace.
>>>>>>> upstream/master
<<<<<<< HEAD
In [61]:
# your code here
=======
=======
>>>>>>> upstream/master
In [ ]:
# your code here
<<<<<<< HEAD
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
# 02. do men seek help more than women?
<<<<<<< HEAD <<<<<<< HEAD
In [62]:
# %load solutions/q02.py
heart_df.replace({'sex': {0.: 'F', 1.: 'M'}}, inplace=True)  
# We would use a countplot
ax = sns.countplot(x="sex", data=heart_df)
ax.set_title('Count of M vs. F who seek cardio examinations');
In [63]:
heart_df.replace({'sex': {'F': 0., 'M': 1.}}, inplace=True)
=======
=======
>>>>>>> upstream/master
In [ ]:
# %load solutions/q02.py
<<<<<<< HEAD
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD

The number of feature variables (after converting categorical variables to dummy ones) is: 1 (age) + 1 (sex) + 3 (cp) + 1 (restbp) + 1 (chol) + 1 (fbs) + 2 (restecg) + 1 (thalach) + 1 (exang) + 1 (oldpeak) + 2 (slope) + 1 (ca) + 2 (thal) = 18

In [64]:
# your code here
=======
=======
>>>>>>> upstream/master
In [ ]:
# your code here
<<<<<<< HEAD
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
# 03. Examine the variables. How do they relate to one another?
<<<<<<< HEAD <<<<<<< HEAD
In [65]:
# %load solutions/q03.py
categorical = ["sex", "cp", "fbs", "restecg", "exang",  "slope", "ca", "thal", "hd"]
numerical = ["age","restbp", "chol", "thalach",  "oldpeak"]

# pandas trick: give me all rows of numerical columns
sns.set_context("notebook", font_scale=1, rc={"lines.linewidth": 2.5})
df_to_plot = heart_df.loc[:,numerical]
sns.pairplot(df_to_plot);

plt.show()

# Look at correlation coefficients too
corr_matrix = heart_df.corr()
corr_matrix['hd'].sort_values(ascending=False)
=======
=======
>>>>>>> upstream/master
In [ ]:
# %load solutions/q03.py
<<<<<<< HEAD
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
Out[65]:
hd         1.000000
thal       0.530603
ca         0.455398
exang      0.427123
oldpeak    0.424947
cp         0.412597
slope      0.335926
sex        0.281912
age        0.223498
restecg    0.157941
restbp     0.153849
chol       0.067350
fbs        0.000192
thalach   -0.430108
Name: hd, dtype: float64
In [66]:
# your code here
=======
=======
>>>>>>> upstream/master
In [ ]:
# your code here
<<<<<<< HEAD
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
# 04. What percentage of men and women seek cardio exams? 
<<<<<<< HEAD <<<<<<< HEAD
In [69]:
# %load solutions/q04.py
# first find percentages
per_men = (heart_df.sex.value_counts()[1])/(heart_df.sex.value_counts()[0]+heart_df.sex.value_counts()[1])
per_wom = (heart_df.sex.value_counts()[0])/(heart_df.sex.value_counts()[0]+heart_df.sex.value_counts()[1])
per_men, per_wom

labels = 'Men', 'Women'
explode = (0, 0.1)  # only "explode" the 2nd slice 
sizes = [per_men, per_wom]

# First and last time I will use a pie chart, let alone an exploding one!!
fig1, ax1 = plt.subplots()
ax1.pie(sizes, explode=explode, labels=labels, autopct='%1.1f%%', shadow=True, startangle=90)
ax1.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
=======
=======
>>>>>>> upstream/master
In [ ]:
# %load solutions/q04.py
<<<<<<< HEAD
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
In [70]:
# your code here
=======
=======
>>>>>>> upstream/master
In [ ]:
# your code here
<<<<<<< HEAD
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
# 05. Does resting blood pressure increase with age?
<<<<<<< HEAD <<<<<<< HEAD
In [71]:
# %load solutions/q05.py
fig, ax = plt.subplots(figsize=(20,6)) 
ax = sns.boxplot(x="age", y="restbp", data=heart_df)
ax.set_ylabel(None);
ax.set_xlabel('age', fontsize=14);
ax.set_ylabel('restbp (mmHg)', fontsize=14);
ax.set_title('Percentile Distibution for age and rest blood pressure', fontsize=14);
=======
=======
>>>>>>> upstream/master
In [ ]:
# %load solutions/q05.py
<<<<<<< HEAD
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
>>>>>>> upstream/master
<<<<<<< HEAD <<<<<<< HEAD
======= ======= >>>>>>> upstream/master
<<<<<<< HEAD
<<<<<<< HEAD >>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1 ======= >>>>>>> upstream/master =======
>>>>>>> upstream/master

Bonus: Find the hidden pattern

Read the following file into a pandas Dataframe: '../data/mystery.csv' and plot it. How does it look? You should see a beautiful pattern. If not, think of ways to fix the issue.

In [ ]:
<<<<<<< HEAD <<<<<<< HEAD
mystery = pd.read_csv('../data/mystery.csv',  sep=' ', header=None) 
=======
mystery = pd.read_csv('../data/mystery.csv',  sep=' ', header=None) 
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
mystery = pd.read_csv('../data/mystery.csv',  sep=' ', header=None) 
>>>>>>> upstream/master
mystery.head()
In [ ]:
<<<<<<< HEAD <<<<<<< HEAD
# your code here
=======
# your code here
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
# your code here
>>>>>>> upstream/master
In [ ]:
<<<<<<< HEAD <<<<<<< HEAD
# this solution will be revealed in the next lab
# %load solutions/mystery.py
=======
# %load solutions/mystery.py
>>>>>>> 1c99c5f43812ebd24de8a731ce94a956de665cd1
=======
# %load solutions/mystery.py
>>>>>>> upstream/master